Skip to content

Fix non-deterministic node pks causing duplicate nodes across runs#17

Open
mscottford wants to merge 11 commits intovladbatushkov:mainfrom
mscottford:fix-non-deterministic-node-pks
Open

Fix non-deterministic node pks causing duplicate nodes across runs#17
mscottford wants to merge 11 commits intovladbatushkov:mainfrom
mscottford:fix-non-deterministic-node-pks

Conversation

@mscottford
Copy link
Copy Markdown
Contributor

Note: This PR depends on #13 (upgrade to .NET 10), #14 (fix transitive reference drop), #15 (parallelize build and analysis), and #16 (stream pipeline between stages).

Summary

  • Duplicate node bug: string.GetHashCode() is randomized per-process in .NET Core and later. Using it as the Neo4j pk meant every run produced a different value for the same logical node, defeating MERGE and silently creating duplicate nodes. The symptom was projects shared between two solutions (e.g. via a git submodule) appearing as two separate Project nodes with CONTAINS edges from each solution, rather than one shared node with edges from both.

  • Fix: Replaced GetHashCode() with MD5 in all three SetPrimaryKey overrides (Node, MethodNode, PackageNode). MD5 is deterministic across processes and runtimes; since this is content-addressing rather than security, it's the right tool.

  • Neo4j concurrency: Also bumps the SemaphoreSlim cap for concurrent Neo4j insert sessions from 4 to Environment.ProcessorCount, consistent with the same cap used for the build stage.

Migration note: Existing databases contain nodes with stale random-hash pks. Re-run the tool with --delete to clear and rebuild the graph with deterministic pks.

Test plan

  • Run dotnet test — existing regression test should pass
  • Run the tool twice against the same solution without --delete and confirm no duplicate nodes are created
  • Run against two solutions sharing projects (e.g. a solution with a submodule) and confirm shared projects appear as a single node with CONTAINS edges from both solutions

🤖 Generated with Claude Code

mscottford and others added 11 commits April 10, 2026 13:57
- Targets net10.0 (up from net9.0)
- Buildalyzer/Workspaces 7.1.0 → 8.0.0
- Microsoft.CodeAnalysis/CSharp 4.13.0 → 5.3.0 (Analyzers pin removed)
- Neo4j.Driver 5.27.0 → 6.0.0
- System.CommandLine 2.0.0-beta4 → 2.0.5 (first stable release)
- Dockerfile base image updated to dotnet/sdk:10.0-alpine
- Neo4j image updated to 2026.03.1 with corrected env var names
- docker-compose.yml: removed obsolete version attribute, replaced
  hardcoded Windows volume path with relative ./SystemUnderTest path
- Code updated for breaking API changes in System.CommandLine 2.0
  (SetAction, Options.Add, Aliases.Add, Required, AllowMultipleArgumentsPerToken)
  and Neo4j.Driver 6.0 (IDriver.CloseAsync removed, use await using)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…from analysis

When AddToWorkspace(workspace, addProjectReferences: true) is called for a project,
Buildalyzer eagerly adds its referenced projects into the workspace. The previous
deduplication check silently dropped those projects on their own loop iteration,
causing them to receive no CONTAINS triple and no code analysis.

Adds a regression test using a two-project fixture solution where ProjectA references
ProjectB — ensuring both appear in the analysis context regardless of workspace
insertion order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Stage 1: Build projects in parallel using .AsParallel() in GetAnalysisContext;
  each p.Build() spawns an isolated MSBuild process so there is no shared state.
  AdhocWorkspace mutations remain sequential (not thread-safe).

- Stage 3: Analyze and insert all projects concurrently using Task.WhenAll with
  Task.Run so CPU-bound syntax tree walking runs across thread-pool threads.
  The delete step is extracted to DbManager.DeleteData and run once upfront
  before the parallel loop to avoid races.

- Stage 4: A SemaphoreSlim(4) gates concurrent Neo4j sessions to avoid
  exhausting the connection pool on large solutions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The workspace population step (AddToWorkspace) is the slowest sequential
phase on large solutions and previously emitted no output, causing a
silent pause between "Building projects - finished." and "done."

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l to complete

Previously the pipeline had two hard barriers: all projects had to finish
building before workspace loading began, and all projects had to finish
loading before analysis began.

The Build stage now launches all MSBuild design-time builds concurrently via
Task.WhenEach. As each build completes it flows immediately into the Load
stage (sequential AdhocWorkspace mutation) and then yields a
(Project, IAnalyzerResult) pair to the caller. The Analyze and Insert stages
pick up each pair as it arrives via IAsyncEnumerable, so analysis of early-
finishing projects begins while later projects are still building and loading.

The solution-order sort is removed — AddToWorkspace(addProjectReferences:true)
handles any insertion order correctly via the transitive-reference path
introduced in the previous fix.

GetAnalysisContext is updated to async Task<AnalysisContext> to consume the
stream; the regression test is updated accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AddToWorkspace(addProjectReferences: true) calls analyzer.Build() internally
for any referenced project not yet in the workspace. When Build and Load were
interleaved via Task.WhenEach, in-flight build tasks raced with AddToWorkspace's
internal builds for the same project, causing Buildalyzer's environment detection
to fail with InvalidOperationException: Could not find build environment.

Fix: wait for all builds to complete (Task.WhenAll) before starting the Load
stage. Build → Load streaming is lost, but Load → Analyze streaming is retained:
each project is yielded as soon as it is loaded, so analysis begins immediately
without waiting for all projects to finish loading.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all projects were built in parallel with no limit, spawning one
dotnet process per project simultaneously. A SemaphoreSlim(Environment.ProcessorCount)
now gates the Build stage so at most one build per logical core runs at a time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
string.GetHashCode() is randomized per-process in .NET Core and later (hash
randomization was introduced to prevent hash-flooding DoS attacks). Using it
as the Neo4j pk meant every run produced a different pk for the same logical
node, defeating MERGE and silently creating duplicate nodes instead of
updating existing ones.

Symptom: projects shared between two solutions (e.g. via a git submodule)
appeared as two separate Project nodes in the graph rather than one node with
CONTAINS edges from both Solution nodes.

Fix: replace GetHashCode() with MD5 in all SetPrimaryKey overrides. MD5 is
stable across processes and runtimes and is appropriate here since this is a
content-addressing use case, not a security one.

Note: existing databases will have stale pks. Re-analyze with --delete to
clear and rebuild the graph with deterministic pks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant